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Abstract 

High-density DNA arrays, used to monitor gene expression at a genomic scale, have 
produced vast amounts of information which require the development of efficient 
computational methods to analyze them. The important first step is to extract 
the fundamental patterns of gene expression inherent in the data. This paper de- 
scribes the application of a novel clustering algorithm, Super-Paramagnetic Cluster- 
ing (SPC) to analysis of gene expression profiles that were generated recently during 
a study of the yeast cell cycle. SPC was used to organize genes into biologically rel- 
evant clusters that are suggestive for their co-regulation. Some of the advantages of 
SPC are its robustness against noise and initialization, a clear signature of cluster 
formation and splitting, and an unsupervised self-organized determination of the 
number of clusters at each resolution. Our analysis revealed interesting correlated 
behavior of several groups of genes which has not been previously identified. 



1 Introduction 



DNA microarray technologies have made it straightforward to monitor simul- 
taneously the expression levels of thousands of genes during various cellular 
processes [1] [2] . The new challenge is to make sense of such massive expression 
data [3]. In most such experiments, investigators compare the relative change 
of gene expression levels between two samples (one is called the target, such 
as a disease sample; the other is called the control, such as a normal sample). 
In a typical experiment simultaneous expression levels of thousands of genes 
are viewed over a few tens of time-points (or different tissues [4]). Hence one 
needs to analyse arrays that contain 10 5 — 10 6 measurements. 

The aims of such analysis are typically to (a) group genes with correlated 
expression profiles; (b) Focus on those groups which seem to participate in 
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some biological process; (c) Provide a biological interpretation of the clusters. 
Interpretations could be co-regulation of the mean cluster expression with a 
known process, a promoter common to most of the genes in the cluster, etc. 
(d) In experiments that compare data from different tissues (such as tumor 
and normal [4]) one also tries to differentiate them on the basis of their genetic 
expression profiles. 

The sizes of the datasets and their complexity call for multi-variant cluster- 
ing techniques which are essential for extracting correlated patterns in the 
swarm of data points in multidimensional space (for example, each relative 
gene expression profile with k time-points may be regarded as a k-dimensional 
vector). 



2 SPC 

Currently, two clustering appoaches are very popular among biologists. One 
is average linkage, a hierarchical clustering method [5], with the Pearson cor- 
relation used as a similarity measure [6]. The other is self-organizing maps 
(SOMs) [7], whose most popular implementation for array data analysis is 
GENECLUSTER [8]. 

We present here clustering performed by SPC, a hierarchical clustering method 
recently introduced by Blatt et al [9] . It is based on an analogy to the physics 
of inhomogeneous ferromagnets. Full details of the algorithm [10] and the 
underlying philosophy [11] are given elsewhere ; here only a brief description 
is provided. 

The input required for SPC is a distance matrix between the iV data points 
that are to be clustered. From such a distance matrix one constructs a graph, 
whose vertices are the data points and edges identify neighboring points. Two 
points i and j are called neighbors (and connected by an edge) if they satisfy 
the K-mutual-neighbor criterion, i.e. iff j is one of the K nearest points to i 
and vice versus. With each edge we associate a weight Jy > 0, which decreases 
as the distance between points % and j increases. 

Assignement of the datapoints to clusters is equivalent to partitioning this 
weighted graph. Cluster indices play the role of the states of Potts spins as- 
signed to each vertex (i.e. to each original data point). Two neighboring spins 
are interacting ferromagnetically with strength J^. This Potts ferromagnet is 
simulated at a sequence of temperatures T. The susceptibility and the correla- 
tion function for neighboring pairs of spins are measured. The pair correlation 
function serves to identify clusters: high correlation means that the two data 
points belong to the same cluster. 
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The temperature T controls the resolution at which clustering is performed; 
the algorithm finds typical clusters at all resolutions. At very low temperatures 
all points belong to a single cluster and as T is increased, clusters break 
into smaller ones until at high enough temperatures each point forms its own 
cluster. The clusters found at all temperatures form a dendrogram. Blatt et 
al showed that the SPC algorithm is robust since the clusters are formed due 
to collective behavior of the system. The major splits can be easily identified 
by a peak in the susceptibility. For more details see [9-1 1] . 



3 Yeast Cell Cycle and Microarray Data 

We applied SPC on a recently published data set [14] to determine whether 
it could automatically expose known clusters without using prior knowledge. 
Eisen et al [6] clustered the genes on the basis of data combined from 7 dif- 
ferent experiments. We suspected that mixing the results of different experi- 
ments may introduce noise into the data associated with a single one. There- 
fore we chose to use only a single time course, that of gene expression as 
measured in a single process (cell division cycles following alpha-factor block- 
and- release [12]). Furthermore, we focused on genes that have characterized 
functions (2467 genes) for easier interpretation. 

Genetic controls and regulation play a central role in determination of cell 
fate during development. They are also important for the timing of cell cycle 
events such as DNA replication, mitosis and cytokinesis. Yeast is a single cel- 
lular organism, which has become a favorite model in molecular biology due 
to the easiness of genetic and biochemical manipulation and the availability of 
the complete genome. Like all living cells, the yeast cell cycle consists of four 
phases: Gl— >S— >G2— »M— >G1..., where S is the phase of DNA synthesis (repli- 
cating the genome); M stands for mitosis (division into two daughter cells), 
and the two gap phases are called Gl (preceding the S phase) and G2 (follow- 
ing the S phase). At least four different classes of cell cycle regulated genes 
exist in yeast [13]: Gl cyclins and DNA synthesis genes are expressed in late 
Gl; histone genes in S; genes for transcription factors, cell cycle regulators and 
replication initiation proteins in G2; and genes needed for cell separation are 
expressed as cells enter Gl. Early and late Gl-specific transcription is medi- 
ated by the Swi5/Ace2 and Swi4/Swi6 classes of factors, respectively. Changes 
in the master cyclin/Cdc28 kinases are involved in all classes of regulation. 

In the alpha-factor block-release experiments, MATa cells were first arrested in 
Gl by using alpha pheromone. Then the blocker was removed; from this point 
on the cell division cycle starts and the population progresses with significant 
cell cycle synchrony. RNA was extracted from the synchronized sample, as 
well as a control sample (asynchronous cultures of the same cells growning 
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exponentially at the same temperature in the same medium). 

Fluorescently labeled cDNA was synthesized using Cy3 ("green") for the con- 
trol and Cy5 ("red") for the target. Mixtures of equal amounts of the two 
samples were taken at every 7min and competitively hybridized to individual 
microarrays containing essentially all yeast genes. The ratio of red to green 
light intensity (proportional to the RNA concentrations) was measured by 
scanning laser microscopy (See [12] for experimental details). The actual data 
provided at the Stanford website [14] is the log ratios. 

In the their analysis, Spellman et al. were focusing on identification of 800 cell 
cycle regulated genes (that may have periodic expression profiles). In our test 
of SPC, in addition to oscillatory genes we were also looking for any groups 
of genes with highly correlated expression patterns. 



4 SPC Analysis of Yeast Gene Expression Profiles 

We clustered the expression profiles of the 2467 yeast genes of known func- 
tion over data taken at 18 time intervals (of 7 min) during two cell division 
cycles, synchronised by alpha arrest and release. Denote by the relative 
expression of gene i at time interval j. Our data consist of 2467 points in an 
18-dimensional space, normalised in the standard way: 

n — E ij -<E i > _ jp . _ 1 v^18 W ■ rr 2 — 1 X^ 18 P 2 P \2 

We looked for clusters of genes with correlated expression profiles over the 
two division cycles. The SPC algorithm was used with q = 20 component 
Potts spins, each interacting with those neighbors that satisfy the K-mutual 
neighbor criterion [10] with K = 10. Euclidean distance between the normal- 
ized vectors was used as the distance between two genes. This distance is 
proportional to the Pearson correlation used by Eisen et. ai. 

At T = all datapoints form one cluster, which splits as the system is 
"heated". The resulting dendrogram of genes is presented in Fig. 1. Each node 
represents a cluster; only clusters of size larger than 6 genes are shown. The 
last such clusters of each branch, as well as non-terminal clusters that were 
selected for presentation and analysis (in a way described below) are shown 
as boxes. The circled boxes represent the clusters that are analysed below. 

The position of every node along the horizontal axis is determined for the 
corresponding cluster according to a method introduced by Alon et al [4]; 
proximity of two clusters along this axis indicates that the corresponding tem- 
poral expression profiles are not very different. The vertical axis represents the 
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resolution, controlled by the "temperature" T > 0. The vertical position of a 
node or box is determined by the value of T at which it splits. A high verti- 
cal position indicates that the cluster is stable, i.e. contains a fair number of 
closely-spaced data points (genes with similar expression profiles). 

In order to identify clusters of genes whose temporal variation is on the scale 
of the cell cycle, we calculated for each cluster a cycle score S±, defined as 
follows. First, for each cluster C (with Nc genes) we calculate the average 
normalized expression level at all j = 1, 18 time intervals and the corre- 
sponding standard deviations cr c (j): 

G C \j) = ± E G« [a C {j)\ 2 = ^ E(^) 2 " \G c (j)] 2 



Next, we evaluated the Fourier transform of the mean expression profiles G (j) 
for every gene cluster C. To suppress initial transients, the Fourier transform 
is performed only over j = 4, 18. Denote the absolute values of the Fourier 
coefficients by A k ; the ratio between low- frequency coefficients and the high- 
frequency ones was used as a figure of merit for the time scale of the variation. 
We observed that clusters that satisfy the condition 

s? = PHr > 2 - 15 W 



have the desired time dependence, and found 29 clusters (consisting of 167 
genes) to have such scores. For many of these clusters, however, the temporal 
variation was very weak, i.e. of the same order as the standard deviations 
<J C \j) of the individual gene expressions of the cluster. We defined another 
score, S% , for which we required 
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For clusters C that satisfy this condition the "signal" significantly exceeds the 
noise. We select a cluster if its score exceeds 5.6, while its parent's score does 
not. Only 4 clusters, containing 86 genes, satisfy both conditions (1) and (2); 
these are numbered 1 - 4 on Fig. 1. Seven additional relatively stable clusters 
which did not satisfy our two criteria, but are of interest, are also selected and 
circled on figure 1. 

The corresponding time sequences are shown in Fig 2: G (j) is plotted for each 
cluster versus time j, with the error bars representing the standard deviations 
cr c (j). Clusters 1,2 and 4 clearly corresponds to the cell cycle. 
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Fig. 1. Dendrogram of genes. Clusters 1-4 were selected according to our criteria, 
eq. (1-2). The other circled and numbered clusters are also interesting (see text). 



5 Details and Interpretation of gene clustering. 



The full lists of genes that constitute the 11 selected clusters are given in our 
website [15] . We present here a short analysis of our clusters. We use standard 
notation for bases: R stands for A or G, W for A or T, K for G or T, N for 
any base. 

Cluster # 1: These are mostly Late Gl phase specific genes. They con- 
tain the major cell cycle regulators: Clnl,2, Clb5,6 and Swi4 as well as DNA 
replication and repair genes. One can easily detect MCB (ACGCGT) or SCB 
(CRCGAAA) sites in their promoters to which MBF (Swi6p+Mbplp) and 
SBF (Swi6p+Swi4p) bind respectively [13]. 

Cluster ^ 4: This cluster contains mostly S phase genes and is dominated by 
the histones. Histones are required for wrapping up nascent DNA into nucle- 
osomes, their promoters are regulated by CCA (GCGAARYTNGRGAACR) , 
NEG (CATTGNGCG) as well as SCB (CGCGAAA) [3]. 

Cluster ^ 2: These are mostly G2/M phase genes. They contain the ma- 
jor cell cycle regulators: Clbl,2 and Swi5/Ace2. It is known that all genes 
co-regulated with Obi, 2 are mainly controlled by either Mcml at P-box 
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Fig. 2. Mean normalized expression of selected clusters, versus time, measured at 
intervals of 7 minutes. Error bars represent the standard deviations a (j)- N c is the 
number of genes in each cluster. The clusters are numbered as in figure 1. 

(TTWCCYAAWNNGGWAA) or by Mcml+SFF through the composite site: 
(P-box)N2-4RTAAAYAA [12] [3]. 

Clusters # 5, # 6 and # 8: These are mostly ribosomal protein (RP) 
genes. The genome of Saccharomyces cerevisiae contains 137 genes coding for 
ribosomal proteins [16]. Since 59 genes are duplicated, the ribosomal gene fam- 
ily encodes 78 different proteins, 32 of the small and 46 of the large ribosomal 
subunit. They are co-regulated because they are sub-components of ribosome 
machinery for protein translation. All genes in cluster #6 reside on chromo- 
somes 2, 4 and 5, except rplllb which resides on chromosome 9. All genes 
in clusters ^5 and #8 (which are very close in the dendrogram of Figure 1) 
reside on chromosomes 8-16, except rpsl7b which resides on chromosome 4. 
It is likely that the expression of these ribosomal genes is correlated to their 
chromosomal locations. It is interesting that the expression profiles appear to 



have pronounced oscillations (throughout in #5, at early times in #6 and late 
times in #8). Like most of the RP genes, the ribosomal genes in the 3 clusters 
also contain multiple global Regulator Raplp binding sites in their promoters 
within a preferred window of 15-26 bp [17]. The transcription of most RP 
genes is activated by two Raplp binding sites, 250 to 400 bp upstream from 
the initiation of transcription. Since Raplp can be both an activator and a 
silencer, it is not known whether Raplp is responsible for the oscillation. This 
oscillation could be a result of interplay between cell cycle and Raplp activity 
which determines the mean half life of the RP mRNAs (5-7min, [18]). As fresh 
medium was added at 91min during the alpha-factor experiments, the genes 
in #6 and in #8 may have different responses to the nutrient change. 

Cluster ^7: This cluster has 42 genes that are largely not cell cycle regu- 
lated. These genes have diverse functions in general metabolism. When search- 
ing promoter regions for regulatory elements using gibbsDNA [20], a highly 
conserved motif GCGATGAGNT is shared by 90 % of genes. This element 
seems to be novel, it has some similarity to Gcn4p site TGACTC and Yaplp 
site GCTGACTAATT [19]. When searching the yeast promoter database - 
SCPD [21], we found that the BUF site in the HO gene promoter and the 
UASPHR site in the Rad50 promoter appear to contain the core motif GAT- 
GAG. Although we do not know if this element is functional or what might be 
the trans-factor, it is still very likely that it may contribute the co-regulation 
of this cluster of genes. 

Cluster ^10: This cluster is characterized by a pronounced dip towards the 
end of the profile. They are not cell cycle regulated by and large, except Clb4 
(a S/G2 cyclin ) and Rad54 (a Gl DNA repair gene). By searching promoter 
elements, we found a conserved motif RNNGCWGCNNC that is shared by a 
subset of the genes (Clb4, YNL252C, Rad54, RpblO, Atpll and Pexl3). It 
partially matches a PH04 binding motif (TCGGGCCACGTGCAGCGAT) in 
the promoter of Pho8. However, the PH04 consensus, CACGTK, does not 
appear in the conserved motif of our cluster. Therefore we suspect that it is a 
novel motif which should be tested by experiments. 



6 Summary 

We used the SPC algorithm to cluster gene expression data for the yeast 
genome. We were able to identify groups of genes with highly correlated tem- 
poral variation. Three of the groups found clearly correspond to well known 
phases of the cell cycle; some of our observations of other clusters reveal fea- 
tures that have not been identified previously and may serve as the basis of 
future experimental investigations. 
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